-
Notifications
You must be signed in to change notification settings - Fork 92
Fix autoscaler returning fatal error on GOAWAY if no scaling seen #973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
|
||
async fn fake_server<F>(response_maker: F) -> FakeServer | ||
where | ||
F: FnMut() -> Response<Body> + Clone + Send + Sync + 'static, | ||
F: FnMut() -> BoxFuture<'static, Response<Body>> + Clone + Send + Sync + 'static, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change while trying to get the fake server to create the goaway. Couldn't do it, but, it's probably useful for the fake responses to be async anyway.
6efa35e
to
68a9191
Compare
// Only propagate errors out if they weren't because of the short-circuiting | ||
// logic. IE: We don't want to fail callers because we said we wanted to know | ||
// about ResourceExhausted errors, but we haven't seen a scaling decision yet, | ||
// so we're not reacting to errors, only propagating them. | ||
return !e | ||
.metadata() | ||
.contains_key(ERROR_RETURNED_DUE_TO_SHORT_CIRCUIT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I get some background here? I expect Tonic to hide go-away and implicitly handle reconnects. We send a go away every 5m on average. Is there a situation where a regular non-worker client use may see some go-away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug in hyper or tonic that was worked around here: #811
But, the short-circuit that the autoscaler turns on so it can scale better on otherwise non-fatal errors makes this come through as fatal - so it gets ignored specifically here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(marking approved anyways, just trying to understand some strangeness here)
So IIUC the server by default sends a GoAway
after 5m (aka soft connection close to tell you to stop making more calls after whatever is in flight) and then hard-closes the TCP connection after 7m (because 2m is enough time between soft and hard for you to never hit this in a properly behaving client).
So somehow we're sending RPC calls even after soft close and therefore hitting the 7m limit? If that is the case, there can be data loss in a rare/racy way if hard TCP death occurs during gRPC call (in our case it'd be a task timeout because server might send us task and then close conn). Or maybe Tonic is the one that's eagerly returning a Code::Cancelled
w/ "connection closed"
before it ever even makes the call? Obviously not important if it all works today, but it is a bit confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latter thing is the explanation, but yeah I agree it's confusing.
68a9191
to
a96360b
Compare
What was changed
Fix autoscaling pollers possibly considering a GOAWAY/connection closed thing as fatal when they shouldn't.
Why?
Don't want workers to die in this situation
Checklist
Closes [Bug] Temporal Worker RuntimeError: Poll failure: Unhandled grpc error when polling #964
How was this tested:
Added unit tests, verified with long-running integ test. Unfortunately I could not find any reasonable way to simulate this exact weirdo goaway error thing with the fake in-memory grpc server I have, even after tons of research.
Any docs updates needed?